A Minimum Description Length Approach to Multitask Feature Selection
نویسنده
چکیده
One of the central problems in statistics and machine learning is regression: Given values of input variables, called features, develop a model for an output variable, called a response or task. In many settings, there are potentially thousands of possible features, so that feature selection is required to reduce the number of predictors used in the model. Feature selection can be interpreted in two broad ways. First, it can be viewed as a means of reducing prediction error on unseen test data by improving model generalization. This is largely the focus within the machine-learning community, where the primary goal is to train a highly accurate system. The second approach to feature selection, often of more interest to scientists, is as a form of hypothesis testing : Assuming a “true” model that generates the data from a small number of features, determine which features actually belong in the model. Here the metrics of interest are precision and recall, more than test-set error. Many regression problems involve not one but several response variables. Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across the responses; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features—a project we might call multitask feature selection. This thesis is organized as follows. Section 1 introduces feature selection for regression, focusing on `0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework. Section 2 proposes a novel extension of MDL feature selection to the multitask setting. The approach, called the “Multiple Inclusion Criterion” (MIC), is designed to borrow information across regression tasks by more easily selecting features that are associated with multiple responses. We show in experiments on synthetic and real biological data sets that MIC can reduce prediction error in settings where features are at least partially shared across responses. Section 3 surveys hypothesis testing by regression with a single response, focusing on the parallel between the standard Bonferroni correction and an MDL approach. Mirroring the ideas in Section 2, Section 4 proposes a novel MIC approach to hypothesis testing with multiple responses and shows that on synthetic data with significant sharing of features across responses, MIC outperforms standard FDR-controlling methods in terms of finding true positives for a given level of false positives. Section 5 concludes. 1 Feature Selection with a Single Response The standard linear regression model assumes that a response y is generated as a linear combination of m predictor variables (“features”) x1, . . ., xm with some random noise: y = β1x1 + β2x2 + . . .+ βmxm + , ∼ N (0, σ), (1) where we assume the first feature x1 is an intercept whose value is always 1. Given n observations of the features and responses, we can write the y values in an n× 1 vector Y and the x values in an n×m matrix X. Assuming the observations are independent and identically distributed, (1) can be rewritten as Y = Xβ + , ∼ Nn(0, σIn×n), (2) where β = β1 . . . βm , Nn denotes the n-dimensional Gaussian distribution, and In×n is the n× n identity matrix. The maximum-likelihood estimate β̂ for β under this model can be shown to be the one minimizing the residual sum of squares: RSS := (Y −Xβ)′(Y −Xβ). (3) The solution is given by β̂ = (X ′X)X ′Y (4) 1 ar X iv :0 90 6. 00 52 v1 [ cs .L G ] 3 0 M ay 2 00 9 and is called the ordinary least squares (OLS) estimate for β. In some cases, we may want to restrict the number of features in our model to a subset of q of them, including the intercept. In this case, we pretend that our X matrix contains only the relevant q columns when applying (4); the remaining m− q entries of the β̂ matrix are set to 0. I’ll denote the resulting estimate by β̂q, and the RSS for that model by RSSq := (Y −Xβ̂q)(Y −Xβ̂q). (5) 1.1 Penalized Regression In many cases, regression problems have large numbers of potential features. For instance, [FS04] predicted credit-card bankruptcy using a model with more than m = 67,000 potential features. In bioinformatics applications, it is common to have thousands or tens of thousands of features for, say, the type of each of a number of genetic markers or the expression levels of each of a number of gene transcripts. The number of observations, in contrast, is typically a few hundred at best. The OLS esimate (4) breaks down when m > n, since the m×m matrix X ′X is not invertible due to having rank at most n. Moreover, it’s implausible that a given response is actually linearly related to such a large number of features; a model β̂ with lots of nonzeros is probably overfitting the training data. The statistics and machine-learning communities have developed a number of approaches for addressing this problem. One of the most common is regularized regression, which aims to minimize not (3) directly, but a penalized version of the residual sum of squares: (Y −Xβ)′(Y −Xβ) + λ ‖β‖p , (6) where ‖β‖p represents the `p norm of β and λ is a tunable hyperparameter, to be determined by cross-validation or more sophisticated regularization path approaches. Ridge regression takes the penalty as proportional to the (square of the) `2 norm: λ ‖β‖22. This corresponds to a Bayesian maximum a posteriori estimate for β under a Gaussian prior β ∼ N (0m×1, σ 2 λ Im×m), with σ 2 as in (1) [Bis06, p. 153]. Under this formulation, we have β̂ = (X ′X + λIm×m)X ′Y, which is computationally valid because X ′X + λIm×m is invertible for λ > 0. However, because the square of a decimal number less than one is much smaller than the original number, the `2 norm offers little incentive to drive entries of β̂ to 0—many of them just become very small. Another option is to penalize by the `1 norm, which is known as lasso regression and is equivalent to a double-exponential prior on β [Tib96]. Unlike `2, `1 regularization doesn’t square the coefficients, and hence the entries of β̂ tend to be sparse. In this way, `1 regression can be seen as a form of feature selection—i.e., choosing a subset of the original features to keep in the model (e.g., [BL97]). Sparsity helps to avoid overfitting to the training set; as a result, the number of training examples required for successful learning with `1 regularization grows only logarithmically with the number of irrelevant features, whereas this number grows linearly for `2 regression [Ng04]. Sparse models also have the benefit of being more interpretable, which is important for scientists who want to know which particular variables are actually relevant for a given response. Building regression models for interpretation is further discussed in Sections 3 and 4. If `2 regression fails to achieve sparsity because coefficients are squared, then, say, ` 1 2 regression should achieve even more sparsity than `1. As p approaches 0, ‖β‖p approaches the number of nonzero values in β. Hence, regularization with what is called the “`0 norm” is subset selection: Choosing a small number of the original features to retain in the regression model. Once a coefficient is in the model, there’s no incentive to drive it to a small value; all that counts is the cost of adding it in the first place. The `0 norm has a number of advantages [LPFU08], including bounded worst-case risk with respect to the `1 norm and better control of a measure called the “false discover rate” (FDR), explained more fully in Section 3. Moreover, as [OTJ09, p. 1] note, “A virtue of the [`0] approach is that it focuses on the qualitative decision as to whether a covariate is relevant to the problem at hand, a decision which is conceptually distinct from parameter estimation.” However, they add, “A virtue of the [`1] approach is its computational tractability.” Indeed, exact `0 regularization requires subset search, which has been proved NP-hard [Nat95]. In practice, therefore, an approximate greedy algorithm like forward stepwise selection is necessary [BL05, p. 582]. In a regression model, residual sum of squares is proportional up to an additive constant to the negative log-likelihood of β. Therefore, `0 regularization can be rephrased as a penalized likelihood criterion (with a different λ than in (6)): −2 lnP (Y | β̂q) + λq, (7) 1See, e.g., [Fri08] for an excellent introduction.
منابع مشابه
Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection
Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...
متن کاملAttribute Value Selection Considering the Minimum Description Length Approach and Feature Granularity
In this paper we introduce a new approach to automatic attribute and granularity selection for building optimum regression trees. The method is based on the minimum description length principle (MDL) and aspects of granular computing. The approach is verified by giving an example using a data set which is extracted and preprocessed from an operational information system of the Components Toolsh...
متن کاملTransformer differential protection using wavelet transform
This paper will propose a cascade of minimum description length criterion with entropy approach along with artificial neural network (ANN) as an optimal feature extraction and selection tool for a wavelet packet transform based transformer differential protection. The proposed protection method provides a reliable and computationally efficient tool for distinguishing between internal faults and...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملABN: A Fast, Greedy Bayesian Network Classifier
Adaptive Bayes Network (ABN) is a fast algorithm for constructing Bayesian Network classifiers using Minimum Description Length (MDL) and automatic feature selection. ABN does well in domains where Naive Bayes fares poorly, and in other domains is, within statistical bounds, at least as good a classifier.
متن کاملMultitask Feature Selection with Task Descriptors
Machine learning applications in precision medicine are severely limited by the scarcity of data to learn from. Indeed, training data often contains many more features than samples. To alleviate the resulting statistical issues, the multitask learning framework proposes to learn different but related tasks jointly, rather than independently, by sharing information between these tasks. Within th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/0906.0052 شماره
صفحات -
تاریخ انتشار 2009